#enter elasticsearch host below (pinned to COVID Twitter slack channel):

elasticsearch_host <- "lp01.idea.rpi.edu"

Overview

The motivation of this COVID Twitter study is to inspect prominent discussion themes semantically related to vaccine skepticism, or anti-vax sentiment, via Twitter during the month of mid March->mid April 2020, one of the most pivotal months in the evolution of the coronavirus pandemic.

Specifically, the following semantic phrase is queried: “I would not get a vaccine for coronavirus. Vaccines are fake, and vaccination doesn’t actually work.”

Methodology

In addition to exploring the extent to which this notebook turns relevant queries to a semantic search, I additionally wanted to play around with the kmeans functionality by which the program generates similarity clusters.

The kmeans function in R, by default, takes arguments of the dataset to study, the number of clusters ‘k’, and the maximum number of iterations to be performed. In my notebook, I elected to experiment with one optional kmeans parameter ‘nstart’, the number of unique sets of clusters the kmeans function tests out. The R Documentation (and various additional literature) suggest that increasing the number of test initializations for the kmeans process better helps the function “hone in” on the optimal cluster organization fitting the data, so I wanted to experiment to see how differently the cluster plots would actually render if I did so.

Moreover, in running this notebook, the number of clusters for k-means to generate must be selected manually via Elbow plot inspection (at least until a more refined/automated means of constructing the optimal number of clusters is implemented, as per the notebook author). In light of this, for my specific semantic search inquiry, I select and designate new values for cluster number (k) optimization accordingly.

The dataset

This notebook draws from the cure-and-prevention-classified Elasticsearch index ‘covidevents-data’, adapted from the following source: Extracting COVID-19 Events from Twitter

Analysis methods

We use the default Hartigan-Wong k-means clustering technique via R’s ‘kmeans’ function, with nstart = 30 instead of the default value of 1. See original Rmd ‘covid-twitter-hacl-template’ for a comparison of the master cluster plot generated.

Results

Query setup

# query start date/time (inclusive)
rangestart <- "2020-03-17 00:00:00"

# query end date/time (exclusive)
rangeend <- "2020-04-16 00:00:00"

# query semantic similarity phrase
semantic_phrase <- "I would not get a vaccine for coronavirus. Vaccines are fake, and vaccination doesn't actually work."

# return results in chronological order or as a random sample within the range
# (ignored if semantic_phrase is not blank)
random_sample <- FALSE

# number of results to return (max 10,000)
#**author suggests no more than 1000 for reasonable runtime:
resultsize <- 1000

Selection of optimal number of clusters and subclusters

To find the optimal number of high-level theme clusters for this sample, an elbow plot is used:

The plot mostly represents a smooth curve, but it can be seen that there is a distinct “elbow” point around k=5, so I choose this value of k.

k <- 5

To find the optimal number of topic subclusters for each theme cluster, another elbow plot is generated with a separate curve for each theme cluster:

Each theme cluster follows a similar plot, again representing a smooth curve. This time there appears to be an “elbow” point at approximately k = 4, so this value is chosen for the topic subclusters.

cluster.k <- 4

Visualization of theme clusters and topic subclusters

Discussion

By means of this exploratory notebook run, it can be observed that increasing ‘nstart’ (the number of random initializations of the chosen number of clusters (k) ) in the kmeans function implementation has significant influence on cluster output.

Interestingly, regarding the semantic search returns, the results are not very clear in terms of the extent to which major themes are extracted which illustrate the public attitude towards or preoccupation with vaccine skepticism. This could potentially illustrate weaknesses in the topic-labelling mechanism currently employed to generate labels for the clusterplots (based on word frequencies).

An alternate hypothesis is that my semantic search query is potentially not specific enough to highlight disbelief surrounding or negative attitudes towards vaccines. Either way, this is a matter for follow-up experimentation, such as comparisons of similar but more singularly-themed and specifically-worded semantic searches.

Limitations

Regarding the limitations of this analysis, significant restrictions to search query date range and tweet number capacity had to be observed as per storage and processing constraints. I imagine that, ideally, it would be more insightful to explore trends over a longer period of time- especially in light of recent resurgences and recurring covid-related discussion via Twitter- or to analyze a larger body of Tweets, for clearer overall illustration of clustering trends. I intend to follow up on this subject with the lab group.

The barchart generated from the contents of tweet.vectors.df illustrates that geo-tagged tweets (tweets for which the ‘Place’ level of the user_location_type factor is populated) fall in a distinct minority in comparison to the remainder of the tweets in our sample. This feature of the data will be taken into consideration moving forward with any type of spatial analysis of Twitter discourse.

Future Prospects

***Moving forward, I aim to either a) methods of analyzing these plots with greater statistical rigor or b) simply extract the raw data from the semantic ordering themselves, pre-clustering, in order to draw some real conclusions from these semantic queries, especially in regards to what the relevant trending Twitter topics can tell us about the state of the pandemic and its societal impacts.

***Brainstorming: -maybe conduct PCA/generate biplot, to better characterize/segregate individual subtopics? -Could we do something with the locations? Maybe separate out only the twitter IDs with locations associated, do cluster analyses within a given set of location IDs that specify the same geographic region, study covid-related social media discussion trends by region?

References

[1] https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans

[2] https://blog.exploratory.io/visualizing-k-means-clustering-results-to-understand-the-characteristics-of-clusters-better-b0226fb3dd10